N+1 Redundancy for Virtualized Services with Low Latency Fail-Over

ABSTRACT

Fail-over protection is provided for a server cluster including a plurality of primary nodes supporting user sessions and a standby node. When the standby node determines that a primary node in a cluster has failed, the standby node configures its network interface to use an Internet Protocol (IP) address of the failed primary node. The standby node further retrieves session data for user sessions supported by the failed primary node from a low latency database for the cluster and restores the user sessions at the standby node. When the user sessions are restored, the standby node switches from a standby mode to an active mode.

RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application No.62/779,313 filed 13 December 2018 and U.S. Provisional Application No.62/770,550 filed 21 Nov. 2018. The disclosures of each of thesereferences are incorporated in their entireties by reference herein.

TECHNICAL FIELD

The present disclosure relates generally to failure protection forcommunication networks and, more particularly, to a N+1 redundancyscheme for virtualized services with low latency fail-over.

BACKGROUND

There are two main failure protection schemes for maintaining continuityof service in the event that a network node in a communication networkresponsible for handling user traffic fails. The two main protectionschemes are 1+1 protection and N+1 protection. With 1+1 protection, Nstandby nodes are available for N network nodes to take over thefunction of a failed primary node or nodes. In 1+1 protection schemes,each network node has its own dedicated standby node, which can takeover traffic currently being handled by its corresponding network nodewithout loss of sessions. This is known as “hot standby.” One drawbackof 1+1 protection is that it requires doubling of system resources. WithN+1 protection, 1 standby node is available for N network nodes to takeover the function of a single failed primary node. However, the N+1redundancy scheme typically provides only “cold standby” protection sothat traffic handled by the failed network node is lost at aswitchover.” Existing N+1 solutions, don't preserve state of the failedprimary node resulting in tear-down of existing sessions. This isbecause the standby node is not dedicated to any specific one of the Nprimary nodes, so there was no solution on how to have the state of anyone of the primary nodes available in the backup node after the failure.Ultimately, the only benefit is that capacity will not drop after afailure but it does not provide protection for ongoing sessions.

In case of Virtual Router Redundancy Protocol (VRRP)-based solutions, astandby node may take over the Internet Protocol (IP) address of afailed primary node as well as the functions of the failed primary node,but these solutions do not take over real-time state of the failedprimary node that would be needed for preserving session continuity forsockets. Moreover, the operator of the network has to configure separateVRRP sessions with separate IP addresses for each VRRP relationship(i.e., the standby nodes need a separate VRRP context per each primarynode it is deemed to protect). This way the configuration overhead in abigger cluster makes the solution cumbersome.

SUMMARY

The present disclosure comprises methods and apparatus of providing N+1redundancy for a cluster of network nodes including a standby node and aplurality of primary nodes. When the standby node determines that aprimary node in a cluster has failed, the standby node configures thestandby node to use an IP address of the failed primary node. Thestandby node further retrieves session data for user sessions associatedwith the failed primary node from a low latency database for the clusterand restores the user sessions at the standby node. When the usersessions are restored, the standby node switches from a standby mode toan active mode.

A first aspect of the disclosure comprises methods of providing N+1redundancy for a cluster of network nodes. In one embodiment, the methodcomprises determining, by a standby node, that a primary node in acluster has failed, configuring the standby node to use an IP address ofthe failed primary node, retrieving session data for user sessionsassociated with the failed primary node from a low latency database forthe cluster, restoring the user sessions at the standby node, andswitching from a standby mode to an active mode.

A second aspect of the disclosure comprise a network node configured asa standby node to provide N+1 protection for a cluster of network nodesincluding the standby node and a plurality of primary nodes. The standbynode comprises a network interface for communicating over acommunication network and a processing circuit. The processing circuitis configured to determine that a primary node in a cluster has failed.Responsive to determining that a primary node has failed, the processingcircuit configures the standby node to use an IP address of the failedprimary node. The processing circuit is further configured to retrievesession data for user sessions associated with the failed primary nodefrom a low latency database for the cluster and restore the usersessions at the standby node. After the user sessions are restored, theprocessing circuit switches the standby node from a standby mode to anactive mode.

A third aspect of the disclosure comprises a computer program comprisingexecutable instructions that, when executed by a processing circuit in aredundancy controller in a network node, causes the redundancycontroller to perform the method according to the first aspect. A fourthaspect of the disclosure comprises a carrier containing a computerprogram according to the third aspect, wherein the carrier is one of anelectronic signal, optical signal, radio signal, or non-transitorycomputer readable storage medium.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a server cluster with N+1 redundancy protection.

FIG. 2 graphically illustrates a fail-over.

FIG. 3 illustrates a fail-over procedure according to a firstembodiment.

FIG. 4 illustrates a fail-over procedure according to a secondembodiment.

FIG. 5 illustrates an exemplary fail-over method implemented by astandby node.

FIG. 6 illustrates an exemplary recovery method implemented by a primarynode.

FIG. 7 illustrates an exemplary network node.

DETAILED DESCRIPTION

Referring now to the drawings, FIG. 1 illustrates a server cluster 10with N+1 redundancy protection that implements a virtual networkfunction (VNF), such as a media gateway (MGW) function or Border GatewayFunction (BGF). The server cluster 10 can be used, for example, in acommunication network, such as an Internet Protocol Multimedia Subsystem(IMS) network or other telecom network. The server cluster 10 comprisesa plurality of network nodes, which include a plurality of primary nodes12 for handling user sessions and a standby node 14 providing N+1protection in the event that a primary node 12 fails. Each of thenetwork nodes 12, 14 in the cluster 10 can be implemented by dedicatedhardware and processing resources. Alternatively, the network nodes 12,14 can be implemented as virtual machines (VMs) using shared hardwareand processing resources.

User sessions (e.g., telephone calls, media streams, etc.) aredistributed among the primary nodes 12 by a load balancing node 16. Anorchestrator 18 manages the server cluster 10. A distributed,low-latency database 20 serves as a data store for the cluster 10 tostore the states of the user sessions being handled by the primary nodes12 as hereinafter described. An exemplary distributed database 20 isdescribed in an article titled “DAL: A Locality-Optimizing DistributedShared Memory System” by Gabor Nemeth, Daniel Gehberger and PeterMatray. (Németh, Gábor, Dániel Géhberger, and Péter Mátray. “{DAL}: ALocality-Optimizing Distributed Shared Memory System.” 9th {USENIX}Workshop on Hot Topics in Cloud Computing (HotCloud 17). Santa Clara,Calif.—July 10-11, 2017).

The network nodes 12, 14 are part of the same subnet with a common IPprefix. Each user session is associated with a particular IP address,which identifies the primary node 12 that handles user traffic for theuser session. State information for the user sessions is stored in adistributed, low latency, database 20 that serves the server cluster 10.In the event that a primary node 12 fails, the standby node 14 canretrieve the state information of user sessions handled by the failedprimary node 12 and restore the “lost” user sessions so that servicecontinuity is maintained for the user sessions.

FIG. 2 shows the server cluster 10 of FIG. 1 in simplified form tographically illustrates the basic steps of a fail-over procedure. It isassumed in FIG. 2 that Primary Node 1 has failed. At 1, the failure isdetected by the standby node 14 and the failed primary node 12 isidentified. At 2, the standby node retrieves the state information forPrimary Node 1 from the database 20 and recreates the user sessions atthe standby node 14. At 3, the standby node takes over the IP address ofPrimary Node 1 and configures its network interface to use the IPaddress. At 4, the standby node advertises the location change of the IPaddress. Thereafter, the traffic for user sessions associated with IPAddress 1 will be routed to the standby node 14 rather than to PrimaryNode 1.

The failure protection scheme used in the present disclosure can beviewed as having three separate phases. In a first phase, referred to asthe preparation phase, a redundant system is built so that the system isprepared for failure of a primary node 12. The second phase comprises afail-over process in which the standby node 14, upon detecting a failureof then primary node 12, takes over the active user sessions handled bythe failed primary node 12. After the fail-over process is complete, apost-failure process restores capacity and redundancy to the system thatis lost by the failure of the primary node 12 so that backup protectionis re-established to protect against future network node failures.

During the preparation phase, state information necessary to restore theuser sessions is externalized and stored in the database 20 by eachprimary node 12. Conventional log-based approaches or checkpointing canbe used to externalize the state information. Another suitable method ofexternalizing state information is described in co-pending application62/770,550 titled “Fast Session Restoration of Latency SensitiveMiddleboxes” filed on Nov. 21, 2018.

Necessary data to be stored in the database 20 is dependent on theapplication and the communication protocol used in the application. ForTCP sessions, such state information may comprise port numbers,counters, sequence numbers, various data on Transport Control Protocol(TCP) buffer windows, etc. Generally, all state information that isnecessary to continue the user session should be stored externally.

In order to ensure that a backup is readily available to replace aprimary node 12 that has failed, a “warm” standby node 14 is provisionedand made available to take over any user sessions for a failed primarynode 12. During the provisioning, system checks are performed to ensurethat:

-   -   the image of the standby node 14 is booted;    -   the operating system for the standby node 14 is up and running;    -   the standby node 14 has a live connection to the database 20;        and    -   the standby node 14 shares the same configuration as the other        instances and is connected to the same next-hop routers.

It is not known in advance which of the primary nodes 12 will fail,however, the standby node 14 is ready to fetch necessary stateinformation from the database 20 to take over for any one of the primarynodes 12. This standby mode is referred to herein a “warm” standby.

The fail-over process is triggered by the failure of one of the primarynodes 12. In some embodiments, failure is detected by the standby node12 based on a “heartbeat” or “keepalive” signaling. In some embodiments,the primary nodes 12 may periodically transmit a “heartbeat” signal anda failure is detected when the heartbeat signal is not received by thestandby node 14. In other embodiments, the standby node 14 mayperiodically transmit a “keepalive” signal or “ping” message to each ofthe primary nodes 12. In this case, a failure is detected when a primarynode 12 fails to respond. This “keepalive” signaling process should becontinuously run pairwise between the standby node 14 and each primarynode 12.

In other embodiments, the failure of a primary node 12 can be detectedby another network entity and communicated to the standby node 14 in theform of a failure notification. For example, one primary node 12 maydetect the failure of another primary node 12 and send a failurenotification to the standby node 14. In another embodiment, the database20 may detect the failure of a primary node 12 and send a failurenotification to the standby node 14.

When a fail-over is triggered, the standby node 14 retrieves the IPaddress or addresses of the failed primary node 12, as well as thesession states (e.g., application and protocol dependent context data)necessary to re-initiate the user sessions at the standby node 14. Insome embodiments, the standby node 14 writes the network identify (e.g.,IP address) of the failed primary node 12 into a global key calledSTDBY-IDENTITY, which is stored in the database 20 so that all nodes inthe server cluster 10 are aware that the standby node 14 has assumed therole of the failed primary node 12. Responsive to the failure detectionor failure indication, the standby node 14 configures its networkinterface to use the IP address or addresses of the failed primary node12 and loads the retrieved session states into its own tables. When thestandby node 14 is ready to take over, the standby node 14 broadcasts aGratuitous Address Resolution Protocol (GARP) message with its ownMedium Access Control (MAC) address and its newly configured IPaddress(es), so that the routers in the subnet know to forward packetswith the IP address(es) formerly used by the failed primary node 12 tothe standby node's MAC address. The same general principles also applyto Internet Protocol version 6 (IPv6) interfaces (Unsolicited NeighborAdvertisement message).

During the post-failure phase, the original capacity of the servercluster 10 with N primary nodes 12 and 1 standby node 14 is restored.There are essentially two alternative approaches to restoring the systemcapacity.

In a first approach for the post-failure phase, the standby node 14switches from a standby mode to an active mode and serves onlytemporarily as a primary node, reverting to a “warm” standby mode whendone. The standby node 14 serves only the user sessions that were takenover from the failed primary node 12 and is not assigned to handle anynew user sessions by the load balancing node 16. When the orchestrator18 learns about the failure of a primary node 12, it re-establishes anew primary node 12 to replace the failed primary node 12 and restoresystem capacity according to a regular scale-out procedure. Theorchestrator 18 should ensure that the IP addresses used by the failedprimary node 12 on the user plane are reserved, because these addressesare taken over by the standby node 14. In case of an OpenStack-basedorchestrator 18, reserving the IP addresses means that the “ports”should not be deleted when the failed primary node 12 disappears. Thisrequires, however, garbage collection. The standby node 14, when itterminates, sends a trigger to the orchestrator 18 indicating that theports used by the affected IP addresses can be deleted. After thisnotification, the IP addresses can be assigned to new network nodes(e.g., VNFs).

During the post-failure phase, the operation of the load balancing node18 needs to take into account the failed primary node 12. Immediatelyafter the failure, however, the load balancing node 18 does not assignnew incoming sessions to either the failed primary node 12 or thestandby node 14. As noted above, the standby node 14 continues servingexisting user sessions taken over from the failed primary node 12, butdoes not receive new sessions. After the last session is finished at thestandby node 14, or upon expiration of a timer (MAX_STANDBY_LIFETIME),the standby node 14 erases or clears the STDBY_IDENTITY field in thedatabase 20, sends a notification to the orchestrator 18 indicating thatthe IP addresses of X can be released, and transitions back to a “warm”standby mode. The MAX_STANDBY_LIFETIME timer, if used, is started whenthe standby node 14 takes over for the failed primary node 12.

In a second approach for the post-failure phase, the standby node 14permanently assumes the role of the failed primary node 12 and theorchestrator re-establishes system capacity by initiating a new standbynode 14. In this case, The standby node 14 sends a notification messageor other indication to the orchestrator 18 indicating that the IPaddress(es) of the failed primary node 12 were assumed or taken over bythe standby node 14 so that the orchestrator 18 knows (i) to whichprimary node 12 the IP addresses belong, and (ii) that these IPaddresses cannot be used for new instances of the primary nodes 12 incase of a scale-out. The standby node 14 (now fully a primary node 12)triggers the orchestrator 18 to launch a new instance for the standbynode 14 to restore the original redundancy protection.

There may be circumstance where primary node 12 fails only temporarily,typically because of a VM reboot. Following the restart, the primarynode 12 may try to use its earlier IP address(es), which would cause aconflict with the standby node 14 that is serving the ongoing usersessions associated with those addresses. Before restarting, the primarynode reads the STDBY_IDENTITY key in the database 20. If theSTDBY_IDENTITY key matches the identity of the primary node 12, theprimary node 12 pauses and waits until the key is erased, indicatingthat the IP address used by the standby node has been released, or asksfor new configuration parameters from the orchestrator 18.

FIG. 3 illustrates an exemplary fail-over procedure used in someembodiments of the present disclosure. When the standby node 14 detectsa node failure or receives a failure notification (step 1), it writesthe network identity (e.g., IP address) of the failed primary node 12into the global key STDBY-IDENTITY stored in database 20 (step 2). Thestandby node 14 sends a GET message to the database 20 to requestsession information for the failed primary node 12 (step 3). In responseto the GET message, the database 20 sends the session data for thefailed primary node 12 to the standby node 14 (step 4). As previouslydescribed, the standby node 14 configures its network interface to usethe IP address of the failed primary node 12 and broadcasts a GARPmessage to the network (step 5). Upon broadcast of the GARP message, therouters in the network will route messages previously sent to the failedprimary node 12 to the standby node 14 and the standby node 14 willhandle the user sessions of the failed primary node 12. When the loadbalancing node 16 is notified of the failure of a primary node 12, theload balancing node 16 removes the primary node 12 from its list ofprimary nodes 12 so that no new sessions will be assigned to the failedprimary node 12 (step 6). Also, when the orchestrator 18 is notified ofthe failure of a primary node 12, the orchestrator 18 instantiates a newinstance of the primary node 12 to replace the failed primary node 12(step 7).

In the embodiment shown in FIG. 3, it is assumed that the standby node14 is only temporarily active and reverts to a standby mode when astandby timer expires or after the last session assumed by the standbynode 14 ends. In this case, when the standby timer expires (step 8), orwhen the last user session ends, the standby node 14 sends a releasenotification message to the orchestrater 18 to release the IP addressassumed by the standby node 14, so that the IP address is available forreassignment (step 9). The standby node 14 also clears the standbyidentity key stored in the database 20 (step 10).

FIG. 4 illustrates another exemplary fail-over procedure used in someembodiments of the present disclosure where the standby 14 permanentlyreplaces the failed primary node 12. Steps 1-6 are the same as thefail-over procedure shown in FIG. 3. After becoming active, the standbynode 14 sends a notification message to the orchestrator 18 and/or loadbalancing node 16 to notify the orchestrator 18 and/or load balancingnode 16 that it taken over the IP address of the failed primary node 12(step 7). The orchestrator 18 then instantiates a new instance of astandby node 14 to replace the previous standby node 14 (step 8). Insome embodiments, the orchestrator 18 may notify the load balancing node16 that the standby node 18 is now designated as a primary node 12. Theload balancing node 16 adds the standby node 14 to its list of availableprimary nodes 12 in response to the notification from the standby node14 or orchestrator 18 (step 9).

FIG. 5 illustrates an exemplary method 100 implemented by a standby node14 in a server cluster 10 including a plurality of primary nodes 12.When the standby node 14 determines that a primary node 12 in a cluster10 has failed (block 110), the standby node 12 configures its networkinterface to use an IP address of the failed primary node 12 (block120). The standby node 14 further retrieves from a low latency databasefor the cluster, session data for user sessions associated with thefailed primary node 12 (block 130) and restores the user sessions at thestandby node 14 (block 140). When the user sessions are restored, thestandby node 14 switches from a standby mode to an active mode (block150).

In some embodiments of the method 100, determining that a primary node12 in a cluster 10 has failed comprises sending a periodic keepalivemessage to one or more primary nodes 12 in the cluster 10, anddetermining a node failure when the failed primary node 12 fails toresponse to a keepalive message.

In some embodiments of the method 100, determining that a primary node12 in a cluster has failed comprises receiving a failure notification.As an example, the failure notification can be received from thedatabase 20.

In some embodiments of the method 100, configuring the standby node 14to use an IP address of the failed primary node 12 comprises configuringa network interface to use the IP address of the failed primary node 12.

In some embodiments of the method 100, configuring the standby node 14to use an IP address of the failed primary node 12 further comprisesannouncing a binding between the IP address and a MAC address of thestandby node 14.

Some embodiments of the method 100 further comprise setting a standbyidentity key in the database to an identity of the failed primary node12.

Some embodiments of the method 100 further comprise, after a last one ofthe user sessions ends, releasing the IP address of the failed primarynode 12 and switching from the active mode to the standby mode.

Some embodiments of the method 100 further comprise, after a last one ofthe user sessions ends, clearing the standby identity key in thedatabase 20.

Some embodiments of the method 100 further comprise notifying anorchestrator 18 that the standby node 14 has replaced the failed primarynode 12 and receiving new user sessions from a load-balancing node 16.

FIG. 6 illustrates an exemplary method 200 of failure recoveryimplemented by a primary node 12 in a cluster 10 of network nodesfollowing a temporary failure of the primary node 12. Following arestart by the primary node 12, the primary node 12 determines whetheran IP address of the primary node 12 is being used by a standby node 14in the cluster 10 of network nodes (block 210). Upon determining thatthe IP address is being used by a standby node 14, the primary node 12obtains a new IP address or waits for the IP address to be released bythe standby node 14 (block 220). In the former case, the primary node 12reconfigures its network interface with the new IP address and returnsto an active mode (block 230, 250). In the latter case, the primary node12 detects release of the IP address by the standby node 14 (block 240)and, responsive to such detection, returns to an active mode (block 250)

In one embodiment of the method 200, the primary node 12 determineswhether an IP address of the primary node 12 is being used by a standbynode 14 in the cluster 10 of network nodes by getting a standby identityfrom a database 20 serving the cluster 10 of network nodes, andcomparing the standby identity to an identity of the primary node 12.

In another embodiment of the method 200, the primary node 12 determineswhen the IP address is released by monitoring the standby identitystored in the database 20 and determining that the IP address isreleased when the standby identity is cleared or erased.

FIG. 7 illustrates an exemplary network node 30 according to anembodiment. The network node 30 can be configured as a primary node 12or as a standby node 14. The network node 30 includes a networkinterface 32 for sending and receiving messages over a communicationnetwork, a processing circuit 34, and memory 36 The processing circuit34 may comprise one or more microcontrollers, microprocessors, hardwarecircuits, firmware, or a combination thereof. Memory 36 comprises bothvolatile and non-volatile memory for storing computer program code anddata needed by the processing circuit 34 for operation. Memory 36 maycomprise any tangible, non-transitory computer-readable storage mediumfor storing data including electronic, magnetic, optical,electromagnetic, or semiconductor data storage. Memory 36 stores acomputer program 38 comprising executable instructions that configurethe processing circuit 34 to implement the procedures and methods asherein described, including one or more of the methods 100, 200 shown inFIGS. 5 and 6. A computer program 38 in this regard may comprise one ormore code modules corresponding to the means or units described above.In general, computer program instructions and configuration informationare stored in a non-volatile memory, such as a ROM, erasableprogrammable read only memory (EPROM) or flash memory. Temporary datagenerated during operation may be stored in a volatile memory, such as arandom access memory (RAM). In some embodiments, computer program 38 forconfiguring the processing circuit 34 as herein described may be storedin a removable memory, such as a portable compact disc, portable digitalvideo disc, or other removable media. The computer program 38 may alsobe embodied in a carrier such as an electronic signal, optical signal,radio signal, or computer readable storage medium. In some embodiments,memory 38 stores virtualization code executed by the processing circuit34 for implementing the network node 30 as a virtual machine.

Those skilled in the art will also appreciate that embodiments hereinfurther include corresponding computer programs. A computer programcomprises instructions which, when executed on at least one processor ofan apparatus, cause the apparatus to carry out any of the respectiveprocessing described above. A computer program in this regard maycomprise one or more code modules corresponding to the means or unitsdescribed above.

Embodiments further include a carrier containing such a computerprogram. This carrier may comprise one of an electronic signal, opticalsignal, radio signal, or computer readable storage medium.

In this regard, embodiments herein also include a computer programproduct stored on a non-transitory computer readable (storage orrecording) medium and comprising instructions that, when executed by aprocessor of an apparatus, cause the apparatus to perform as describedabove.

Embodiments further include a computer program product comprisingprogram code portions for performing the steps of any of the embodimentsherein when the computer program product is executed by a computingdevice. This computer program product may be stored on a computerreadable recording medium.

The methods and apparatus herein described enable provide N+1 redundancyfor a cluster of network nodes including a standby node and a pluralityof primary nodes. When a primary node in a cluster has failed, the usersessions can be restored at the standby node. When the user sessions arerestored, the standby node switches from a standby mode to an activemode.

The above description of illustrated implementations is not intended tobe exhaustive or to limit the scope of the disclosure to the preciseforms disclosed. While specific implementations and examples aredescribed herein for illustrative purposes, various equivalentmodifications are possible within the scope of the present disclosure,as those skilled in the relevant art will recognize. The words “example”or “exemplary” are used herein to mean serving as an example, instance,or illustration. Any aspect or design described herein as “example” or“exemplary” is not necessarily to be construed as preferred oradvantageous over other aspects or designs.

1-43. (canceled)
 44. A method of providing N+1 redundancy for a clusterof network nodes, the method comprising: determining, by a standby node,that a primary node in a cluster has failed; configuring the standbynode to use an Internet Protocol (IP) address of the failed primarynode; retrieving, from a low latency database for the cluster, sessiondata for user sessions associated with the failed primary node;restoring the user sessions at the standby node; and switching from astandby mode to an active mode.
 45. The method of claim 44, whereindetermining that a primary node in a cluster has failed comprises:sending a periodic keepalive message to one or more primary nodes in thecluster; and determining a node failure when the failed primary nodefails to respond to a keepalive message.
 46. The method of claim 44,wherein configuring the standby node to use an IP address of the failedprimary node comprises configuring a network interface to use the IPaddress of the failed primary node.
 47. The method of claim 46, whereinconfiguring the standby node to use an IP address of the failed primarynode comprises announcing a binding between the IP address and a MediumAccess Control (MAC) address of the standby node.
 48. The method ofclaim 44, further comprising setting a standby identity key in thedatabase to an identity of the failed primary node.
 49. The method ofclaim 44, further comprising, after a last one of the user sessionsends: releasing the IP address of the failed primary node; and switchingfrom the active mode to the standby mode.
 50. The method of claim 44,further comprising, upon expiration of a standby timer: releasing the IPaddress of the failed primary node; and switching from the active modeto the standby mode.
 51. A network node providing N+1 protection for aplurality of primary nodes in a cluster, the network node comprising: anetwork interface configured to connect the network node to acommunication network; and processing circuitry configured to: determinethat one of the primary nodes in the cluster has failed; configuring thestandby node to use an Internet Protocol (IP) address of the failedprimary node; retrieving, from a low latency database for the cluster,session data for user sessions associated with the failed primary node;restoring the user sessions at the standby node; and switching from astandby mode to an active mode.
 52. The network node of claim 51,wherein the processing circuitry is configured to: send a periodickeepalive message to one or more primary nodes in the cluster; anddetermine a node failure when the failed primary node fails to respondto a keepalive message.
 53. The network node of claim 51, wherein theprocessing circuitry is configured to configure a network interface touse the IP address of the failed primary node.
 54. The network node ofclaim 53, wherein the processing circuitry is configured to announce abinding between the IP address and a Medium Access Control (MAC) addressof the standby node.
 55. The network node of claim 51, wherein theprocessing circuitry is configured to set a standby identity key in thedatabase to an identity of the failed primary node.
 56. The network nodeof claim 51, wherein the processing circuitry is further configured to,after a last one of the user sessions ends: release the IP address ofthe failed primary node; and switch from the active mode to the standbymode.
 57. The network node of claim 51, wherein the processing circuitryis further configured to, upon expiration of a standby timer: releasethe IP address of the failed primary node; and switch from the activemode to the standby mode.
 58. A method of failure recovery by a primarynode in a cluster, the method comprising the primary node: following arestart by the primary node, determining whether an IP address of theprimary node is being used by a standby node in the cluster; and upondetermining that the IP address is being used by a standby node,obtaining a new IP address or waiting for the IP address to be releasedby the standby node.
 59. The method of claim 58, wherein determiningwhether an IP address of the primary node is being used by a standbynode in the cluster of network nodes comprises: getting a standbyidentity from a database serving the cluster of network nodes; andcomparing the standby identity to an identity of the primary node. 60.The method of claim 58, further comprising reconfiguring a networkinterface of the primary node with the new IP address and returning toan active mode.
 61. A network node in a cluster of network node, thenetwork node comprising: a network interface configured to connect thenetwork node to a communication network; and processing circuitryconfigured to: following a restart by the network node, determinewhether an IP address of the network node is being used by a standbynode in the cluster; and upon determining that the IP address is beingused by a standby node, obtain a new IP address or waiting for the IPaddress to be released by the standby node.
 62. The network node ofclaim 61, wherein the processing circuitry is configured to determinewhether an IP address of the network node is being used by a standbynode in the cluster of network nodes by: getting a standby identity froma database serving the cluster; and comparing the standby identity to anidentity of the network node.
 63. The network node of claim 61, whereinthe processing circuitry is configured to reconfiguring the networkinterface of the network node with the new IP address and returning toan active mode.